ML - Random Forests and Gradient Boosting


MGMT 675
AI-Assisted Financial Analysis
Kerry Back

Outline

  • Decision tree
  • Random forest and gradient boosting
  • Shapley values
  • House price application

Decision tree

  • Split dataset successively into subsets. Within each subset, \(\hat y=\) mean of subset. Calculate MSE.
  • Split on a single variable being above or below a threshold.
  • Choose variable and threshold so that MSE will be as small as possible after the split.

  • After each split, make further splits of all of the new subsets into even smaller subsets, for a specified number of times (# splits = depth).
  • The prediction for any observation is the mean target value in its final group (leaf).

Example

  • Ask Julius to read ml1.xlsx.
  • Ask Julius to fit a decision tree regressor with y1 as the target using all of the data as training data. Ask Julius to plot the tree.

Random forest and gradient boosting

Random Forest

  • Generate random datasets of the same size as the original.
  • Create the random datasets by randomly drawing rows from the original with replacement.
  • Fit a decision tree to each random dataset.
  • The prediction for any observation is the average of the predictions of the various trees.

  • Randomization helps to avoid overfitting.
  • Also control overfitting through:
    • max_depth = maximum number of times to split in each tree
    • max_features = number of features to look at when deciding how to split (a subset of features of that size is randomly chosen for each split)

Gradient Boosting

  • Fit a decision tree.
  • Look at its errors. Fit a new decision tree to predict the errors.
  • New prediction is original plus a fraction of the prediction of original’s error (fraction = learning rate).
  • Look at the errors of the new predictions. Fit a new decision to predict these errors.
  • Continue …

Examples

  • Ask Julius to train and test a random forest regressor to predict y1 in ml1.xlsx.
  • Ask Julius to use GridSearchCV to find the best max_depth in (5, 10, 15, 20).
  • Ask Julius to train and test a gradient boosting regressor to predict y1 in ml1.xlsx.
  • Ask Julius to use GridSearchCV to find the best learning rate in (0.001, 0.005, 0.01, 0.05, 0.1).

Interpreting Models: Shapley Values

  • The Shapley value for a feature at an observation is a measure of how much that feature contributed to the prediction at that observation.
  • A summary of Shapley values is a bar chart showing the mean absolute contribution of each feature (mean across observations).
  • A Shapley scatter plot for a feature plots all of the observations with the feature’s value on the x axis and the feature’s contribution to the prediction on the y axis.

  • Ask Julius to create a summary plot of the Shapley values for the random forest regressor with the best max_depth.
  • Ask Julius to create a scatter plot of the Shapley values for the x1 feature.
  • Ask Julius to create a scatter plot of the Shapley values for another feature.

# House Price Application (TBD)